12. Epsilon-Greedy Policies

Epsilon-Greedy Policies

L612 Epsilon Greedy Policies RENDER V4

## Note

In the video above, you learned about \epsilon-greedy policies.

You can think of the agent who follows an \epsilon-greedy policy as always having a (potentially unfair) coin at its disposal, with probability \epsilon of landing heads. After observing a state, the agent flips the coin.

  • If the coin lands tails (so, with probability 1-\epsilon), the agent selects the greedy action.
  • If the coin lands heads (so, with probability \epsilon), the agent selects an action uniformly at random from the set of available (non-greedy AND greedy) actions.

In order to construct a policy \pi that is \epsilon-greedy with respect to the current action-value function estimate Q, we will set

\pi(a|s) \longleftarrow \begin{cases} \displaystyle 1-\epsilon +\epsilon/|\mathcal{A}(s)|& \textrm{if }a\textrm{ maximizes }Q(s,a)\\ \displaystyle \epsilon/|\mathcal{A}(s)| & \textrm{else} \end{cases}

for each s\in\mathcal{S} and a\in\mathcal{A}(s).

Mathematically, \mathcal{A}(s) is the set of all possible actions at state s (which may be 'up', 'down','right', 'left' for example), and |\mathcal{A}(s)| the number of possible actions (including the optimal one!). The reason why we include an extra term \epsilon/|\mathcal{A}(s)| for the optimal action is because the sum of all the probabilities needs to be 1. If we sum over the probabilities of performing non-optimal actions, we will get (|\mathcal{A}(s)|-1)\times \epsilon/|\mathcal{A}(s)| , and adding this to 1-\epsilon + \epsilon/|\mathcal{A}(s)| gives one.

Note that \epsilon must always be a value between 0 and 1, inclusive (that is, \epsilon \in [0,1]).

In this quiz, you will answer a few questions to test your intuition.

## Quiz

Which of the values for epsilon yields an epsilon-greedy policy that is guaranteed to always select the greedy action? Select all that apply.

SOLUTION:
  • (1) epsilon = 0

Which of the values for epsilon yields an epsilon-greedy policy that is guaranteed to always select a non-greedy action? Select all that apply.

SOLUTION:
  • (5) This is a trick question! The *true answer* is that none of the values for epsilon satisfy this requirement.

Which of the values for epsilon yields an epsilon-greedy policy that is equivalent to the equiprobable random policy (where, from each state, each action is equally likely to be selected)?

SOLUTION:
  • (4) epsilon = 1

Which of the values for epsilon yields an epsilon-greedy policy where the agent has the possibility of selecting a greedy action, but might select a non-greedy action instead? In other words, how might you guarantee that the agent selects each of the available (greedy and non-greedy) actions with nonzero probability?

SOLUTION:
  • (2) epsilon = 0.3
  • (3) epsilon = 0.5
  • (4) epsilon = 1